-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added build and instructions for Pixi environment manager #414
Conversation
platform/build/make.inc.PIXI
Outdated
# git clone [email protected]:gafusion/gacode.git | ||
# pixi add "python~=3.12,<3.13" | ||
# pixi add git | ||
# pixi add mpich |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had built a conda based environment on the PPPL portal cluster at one point, and we had a problem with using more than 32 cores. See #96 . Are you in a position to try building this on a system that can handle more than 32 cores and testing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have access to a partition of the PSFC cluster with 64 cores per node. Let me build the code there and see if I cannot get it to use more than 32 cores on this cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have compiled it on the PSFC cluster using the PIXI instructions. It seems that the CGYRO regression tests return strange results with varying the number of cores. There is something wrong with the OpenBLAS package I used so OpenMP was not working for this test, I will debug that.
In the meantime, here are the regression results with 64 cores:
(py312_gacode) [aaronkho@eofe10 py312_gacode]$ cgyro -r -n 64 -nomp 1
REGRESSION TESTING: cgyro
reg01: ERROR - Regression data was not generated by simulation.
reg02: ERROR - Regression data was not generated by simulation.
reg03: ERROR - Regression data was not generated by simulation.
reg04: ERROR - Regression data was not generated by simulation.
reg05: PASS
reg06: ERROR - Regression data was not generated by simulation.
reg07: PASS
reg08: ERROR - Regression data was not generated by simulation.
reg09: ERROR - Regression data was not generated by simulation.
reg10: ERROR - Regression data was not generated by simulation.
reg11: ERROR - Regression data was not generated by simulation.
reg12: ERROR - Regression data was not generated by simulation.
reg13: ERROR - Regression data was not generated by simulation.
reg14: ERROR - Regression data was not generated by simulation.
reg15: ERROR - Regression data was not generated by simulation.
reg16: ERROR - Regression data was not generated by simulation.
reg17: PASS
reg18: ERROR - Regression data was not generated by simulation.
reg19: ERROR - Regression data was not generated by simulation.
reg20: ERROR - Regression data was not generated by simulation.
reg21: ERROR - Regression data was not generated by simulation.
reg22: ERROR - Regression data was not generated by simulation.
At 32 cores: reg04, reg12, reg13, reg14, reg18
fail.
At 16 cores: reg12
fails
At 8 cores: All regression tests pass
Please let me know if this is intended behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the OpenBLAS issue by adding a missing library (8d4cc83). However, CGYRO regression tests still fail in the same ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to ensure OpenBLAS is single-threaded:
BINARY=64
USE_THREAD=0
USE_LOCKING=1
NO_SHARED=1
NO_CBLAS=1
NO_LAPACKE=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcandy The regression tests do not seem to change, regardless of using OpenBLAS library with single-thread or multi-thread. The only impact is that the computation time is longer with multi-thread, which makes sense with the quoted overhead problems with multi-thread OpenBLAS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pick one regression case that isn't working (say, reg12) and then run cgyro directly in the reg12 directory. We should get an idea where the code is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcandy I ran everything with -n 32
and looked into reg12 and saw that it crashed with this error:
ERROR: (CGYRO) nc ( 24) not a multiple of coll atoa procs ( 32)
Inside the file out.cgyro.mpi, there is this output:
Parallelization and distribution diagnostics
nv: 256
nc: 24
GCD(nv,nc): 8
n_toroidal: 1
nt_loc: 1
[coll] [str] [NL] [NL] [NL] [coll] [str]
n_MPI nc_loc nv_loc n_split atoa[MB] atoa proc atoa proc ared proc
------ ------ ------ ------- -------- --------- --------- ---------
1 24 256 6144 0.10 1 1 1
2 12 128 3072 0.05 1 2 2
4 6 64 1536 0.02 1 4 4
8 3 32 768 0.01 1 8 8
My guess is that the regression test itself is not suitable to be parallelized across more than 24 cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly,
reg04 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)
reg13 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)
reg14 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)
reg18 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching over to -n 24
on reg12 yields other problems, giving the following error:
ERROR: (CGYRO) nv ( 256) not a multiple of coll atoa procs ( 24)
So it seems that -n 8
passes all regression checks purely by being the largest common denominator of all the CGYRO grids within these tests (which naturally extends to -n 4
and -n 2
).
Done |
Pixi environment / package manager allows seamless combination of conda and pypi packages, while requiring no system admin privileges. All recommended regression tests have passed on test build.